Linux Foundation and sparklyr

Javier Luraschi, RStudio

Overview

About RStudio

RStudio’s Multiverse Team

Authors of R packages to support Apache Spark, TensorFlow and MLflow.

Multiverse Timeline

The multiverse team focuses on bringing relevant machine learning technologies to R users to empower and simplify data science workflows.

What is Spark?

“Apache Spark™ is a unified analytics engine for large-scale data processing.”

  • Unified: Spark supports many libraries, clusters technologies and storage systems.
  • Analytics: Analytics is the discovery and interpretation of data to produce and communicate information.
  • Engine: Spark is expected to be efficient and generic.
  • Large-Scale: One can interpret large-scale as cluster-scale, a set of connected computers working together.

Why Spark?

Information grows at exponential rates.

What’s next?

We see Spark supporting multiple projects: TensorFlow, MLFlow, Tuning, etc.

Why R?

Modern R

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Spark and R

In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.

Timeline

2016-2019

From launch to sparklyr 1.0.

Beyond 2020

Aspirational direction beyond 2020.

Use Cases

Technical

Community

Thanks!